All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch

نویسندگان

  • Orphée De Clercq
  • Véronique Hoste
چکیده

Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, although NLP-inspired research has focused on adding more complex readability features, there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and crowdsourcing, we implement different types of text characteristics ranging from easy-tocompute superficial text characteristics to features requiring deep linguistic processing, resulting in ten different feature groups. Both a regression and classification set-up are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task that provides considerable insights in which feature combinations contribute to the overall readability prediction. Because we also have gold standard information available for those features requiring deep processing, we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully automatic readability prediction pipeline is on par with the pipeline using gold-standard deep syntactic and semantic information.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Workshop Predicting and Improving Readability

s Scott Crossley Crowdsourcing text complexity models The current study builds on work by De Clercq et al. (2014) and Crossley et al. (2017) by using crowdsourcing techniques to collect human ratings of text comprehension, processing, and familiarity across a large corpus comprising a diverse variety of topic domains (science, technology, and history). Pairwise comparisons among the ratings wer...

متن کامل

A full ranking method using integrated DEA models and its application to modify GA for finding Pareto optimal solution of MOP problem

This paper uses integrated Data Envelopment Analysis (DEA) models to rank all extreme and non-extreme efficient Decision Making Units (DMUs) and then applies integrated DEA ranking method as a criterion to modify Genetic Algorithm (GA) for finding Pareto optimal solutions of a Multi Objective Programming (MOP) problem. The researchers have used ranking method as a shortcut way to modify GA to d...

متن کامل

A General Investigation on the Combination of Local and Global Feature Selection Methods for Request Identification in Telegram

Nowadays, the use of various messaging services is expanding worldwide with the rapid development of Internet technologies. Telegram is a cloud-based open-source text messaging service. According to the US Securities and Exchange Commission and based on the statistics given for October 2019 to present, 300 million people worldwide used telegram per month. Telegram users are more concentrated in...

متن کامل

Neural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten

Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...

متن کامل

Creating Algorithmic Symbols to Enhance Learning English Grammar

This paper introduces a set of English grammar symbols that the author has developed to enhance students’ understanding and consequently, application of the English grammar rules. A pretest-posttest control-group design was carried out in which the samples were students in two girls’ senior high schools (N=135, P ≤ 0.05) divided into two groups: the Treatment which received gramm...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 42  شماره 

صفحات  -

تاریخ انتشار 2016